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Abstract 

We show that the problem of finding an optimal stochastic 'blind' 
controller in a Markov decision process is an NP-hard problem. The 
corresponding decision problem is NP-hard, in PSPACE, and SQRT-SUM- 
hard, hence placing it in NP would imply a breakthrough in long-standing 
open problems in computer science. Our optimization result establishes 
that the more general problem of stochastic controller optimization in 
POMDPs is also NP-hard. Nonetheless, we outline a special case that 
is solvable to arbitrary accuracy in polynomial time via semidefinite or 
second-order cone programming. 

Keywords: Partially observable Markov decision process, stochastic 
controller, bilinear program, computational complexity, Motzkin-Straus 
theorem, sum-of-square-roots problem, matrix fractional program, semidef- 
inite programming. 
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1 Introduction 

Partially observable Markov decision processes (POMDPs) have proven to be a 
valuable conceptual tool for problems throughout AI, including reinforcement 



learning (Chrisman 1992), planning under uncertainty (Kaelbling et al. 1998) 



and multiagent coordination ( Bernstein et al. 2005 ) . Briefly, a POMDP consists 



of a Markov process over a set of states. The decision maker is unable to per- 
ceive its current state directly, but must infer it based on indirect observations. 
An important problem in this area is deciding ho-w to select actions to minimize 
cost given the state uncertainty. Unfortunately, this problem is extremely chal- 
lenging ( Papadimitriou and Tsitsiklis 1987 Mundhenk et al. 2000). In fact, 



the exact problem is unsolvable in the general case (Madani et al. 19991. 



An alternative to finding optimal policies for POMDPs is to find lo-w cost con- 
trollers — mappings from observation histories to actions (Sondik 1971 Platz- 



man 



1981). A restricted space of controllers can, in principle, be considerably 



easier to search than the space of all possible policies (Littman et al. 1998 



Hansen 1998 Meuleau et al. 1999). Various methods for controller optimiza- 



tion in POMDPs have been proposed in the literature, both for stochastic as 
■well as for deterministic controllers: exhaustive search (Smith 1971), branch 
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and bound (Hastings and Sadjadi 1979 



and Boutilier 2004 



ming (Amato et al. 



Serin and Kulkarni 



Littman 1994), local seach (Poupart 



2005), sequential quadratic program- 



2007), or the EM algorithm (Toussaint et al. 2011 1. 



A variety of complexity results are known for the problem of controller opti- 
mization in POMDPs. Most versions are known to be hard for classes that are 



believed to be above P ( Papadimitriou and Tsitsiklis 1987 Mundhenk et al 



20001. The computational decision problem is: Given a restriction on the con- 



troller and a target cost, can the target cost be achieved by a controller in the 
class? Below, we consider several such controller classes. 



Deterministic time/history-dependent controller Such a controller se- 
lects an action based on the current time period and/or the history of pre- 
vious actions and observations. The problem is NP-complete or PSPACE- 



complete (Papadimitriou and Tsitsiklis 1987 Mundhenk et al. 20001. In the 



remaining classes below we assume stationary controllers. 



Deterministic controller of polynomial size Such a controller is repre- 
sented by a graph in which nodes are labeled with actions and edges are labeled 
with observations. A deterministic controller can approximate the optimal pol- 
icy for any POMDP. The problem is in NP in that we can guess a controller 
of the right size, then see if achieves no more than the target cost by solving 
a system of linear equations. It is NP-hard even for the 'easier' completely 



observable version (Littman et al. 19981. 



Stochastic controller of polynomial size This class extends determinis- 
tic controllers by allowing a probability distribution over actions at each node. 
There are POMDPs for which a stochastic controller of a given size can outper- 



form any deterministic controller of the same size (Singh et al. 1994). In this 



paper we show that this problem is NP-hard, in PSPACE, and SQRT-SUM-hard, 
hence showing it lies in NP would imply breakthroughs in long-standing open 



problems (AUender et al. 2009 Etessami and Yannakakis 2010 ) 



Deterministic memoryless controller A memoryless controller chooses an 
action based on the most recent observation only. These controllers are a special 
case of deterministic controllers with polynomial size as they can be represented 
as a graph with one node per observation. The problem has been shown to be 



NP-complete (Littman, 1994). 



Stochastic memoryless controller These controllers are defined by a prob- 
ability distribution over actions for each observation. They can be consider- 
ably more effective than the corresponding deterministic memoryless controllers. 
They are a generalization of the blind controllers we consider in this paper, and 
it follows from our results that the problem is NP-hard, in PSPACE, and SQRT- 
SUM-hard. 



Deterministic blind controller A blind controller for a POMDP is equiva- 
lent to a memorylcss controller for an unobserved MDP. A deterministic blind 
controller consists of a single action that is applied (blindly) regardless of the 
observation history. It is straightforward to evaluate a deterministic blind 
controller — simply drop all actions but one from the POMDP and evaluate 
the resulting Markov chain. Thus, the decision problem for determinsitic blind 
controllers is trivially in P as an algorithm can simply try each action to see 
which is best. 

Stochastic blind controller Such a controller is a probability distribution 
over actions to be applied repeatedly at every timestep. This is the class of 
controllers we consider in this paper. Again, the added power of stochasticity 
allows for much more effective policies to be constructed. However, as we show 
in the remainder of this paper, the added power comes with a very high cost. 
The decision problem is NP-hard, in PSPACE, and SQRT-SUM-hard. 

2 MDPs and Blind Controllers 

We consider a discounted, with discount factor 7 < 1, infinite-horizon Markov 
decision process (MDP) characterized by n states and k actions, state-action 
costs (negative rewards) Csa, and starting distribution (/i^) with /is > and 
Ss=i A*s = 1- Let p{s\s, a) denote the probability to transition to state s when 
action a is taken at state s. The following linear program (LP) can be used to 
find an optimal policy for the MDP: 



E^ 



mm > XsaC. 



(1) 

S.t. ^SjQ = (1 -7)/is+7^p(s|s,a)Xsa Vj, Xsa>Oys.a, 

a sa 

where Xsa denotes occupancy distribution over state-action pairs, and the con- 
straints are the Bellman flow (probability mass) constraints. From an optimal 
occupancy x*^, we can compute an optimal stationary and deterministic policy 
that maps states to actions (Puterman| 1994). 



We consider now the case where we constrain the class of allowed policies to 
stochastic 'blind' controllers in which the controller cannot observe or remember 
anything (state, action, or time). Instead, the controller simply randomizes over 
actions using the same distribution tt = (tt^) at each time step, where tt G A 
and A = {tt : 77 > Q,^a=i'^a. = 1} is the standard probability simplex. 
Note that, contrary to standard MDP policies, a blind controller n is not a 
function of state. (The related notion of a memorylcss controller is a function of 
POMDP observations, but still not of state.) Explicitly encoding the controller 



parametrization in (fTl) gives: 

min y^ Xs-KaCsa, 
sa 

s.t. xg = (1 -7)^j + 7^7ra^p(s|s,a)a;s Vj, 



(2) 



where x — [xs) is an occupancy distribution over states, with x > 0. When 
viewed as a function of both x and tt, the above constitutes a jointly constrained 



bihncar program that is in general nonconvex in (x, tt) ( Al-Khayyal and Falk 
[1983 j . 

Bilinear programs are known to be NP-hard to solve to global optimality 
in general, but could there be some special structure in ([2]) that renders that 
particular program tractable? In the next section, we answer this question for 
the case where the MDP costs Csa depend nontrivially on both states and actions, 
in which case we show that finding an optimal stochastic blind controller is an 
NP-hard problem. 

3 NP-hardness Result 

Let C — {csa) be the n x fc matrix containing all state-action costs, and /x — {^s) 
be the n x 1 starting distribution vector. The decision version of our problem, 
henceforth called the STOCHASTIC-blind-policy problem, asks, for a given 
MDP with discount factor 7 < 1 and a given target value r, whether there exists 
a stochastic blind controller tt that achieves J{tt) < r, where J{n) = x^Ctt is 
the value of controller tt in ([2| when the n x 1 occupancy vector x is defined 
via the Bellman constraints in (pi). 

Theorem 1. The STOCHASTIC-blind-policy problem is NP-hard. 

Proof. We reduce from the independent-set problem. This problem asks, 
for a given (undirected and with no self-loops) graph G — (V, E) and a positive 
integer j < \V\, whether G contains an independent set V' having \V'\ > j. This 



problem is NP-complete, even when restricted to cubic planar graphs ( Garey 



and Johnson 1979) 



Let G be the n x n (symmetric, 0-1) adjacency matrix of an input cubic 
graph G. The reduction constructs an MDP with n states and n actions, uniform 
starting distribution /x, cost matrix C = -(G-l-I) where I is the identity matrix, 
and deterministic transitions p{s\s,a) = 1 if s = a and otherwise (where 
the action variable a can be viewed as indexing the state space). Since the 
transitions p{s\s,a) are independent of s, the occupancy vector in ([2| reduces 
to x = (1 — 7)/! + 77r, and the value function becomes the quadratic 

J(7r) = ^^^^+7r^(G + I)7r, (3) 



where we used the fact that the input graph is cubic (each node has degree 



three) and fj, is uniform. The Motzkin-Straus theorem (Motzkin and Straus 
1965[ ) states that 

I)7r, (4) 



«(G) 



min TT (G 
Ti-eA 



where a(G) is the size of the maximum independent set (the stability number) 

1 , 4(1-7) 

.T/r' I T^^ ^ 1 g^jj^fj hence from (B follows that the existence of 



Then, J{tt) < r is 



of the graph. Let the target value be r 

equivalent to 7r^(G + I)7r ; 

a vector tt that satisfies J{tt) < r would imply ~o-,s < i, and hence a{G) > j 

or, in other words, \V'\ > j for some V' . D 



4 On the Complexity Upper Bound 

Our STOCHASTIC-BLIND-POLICY problem is contained in PSPACE, as it can be 
expressed as a system of polynomial inequalities — any such system is known to 



be solvable in PSPACE (Canny, 1988). But, is there a tighter upper bound? 



We will attempt to address this question indirectly, by establishing a con- 
nection between the STOCHASTIC-blind-policy problem and the SQRT-SUM 
problem. The SQRT-SUM problem asks, for a given list of integers ci, . . . , c„ and 
an integer d, whether Y^^^i ^Jci < d. The problem is conjectured to lie in P, 
but it is not even known to lie in NP. The difficulty of obtaining an exact com- 



plexity for this problem has been recognized for at least 35 years ( Carey et al 



1976). AUender et al. (2009) showed that sqrt-sum lies in the 4th level of the 



Counting Hierarchy, and Etessami and Yannakakis (2010) showed that SQRT 



SUM reduces to the problem of approximating 3-player Nash equilibria. Here, we 
show that stochastic-blind-policy is at least as hard as sqrt-sum, hence 
a result that would place stochastic-blind-policy in NP would resolve sev- 



eral open problems in computer science ( Allender et al. , 2009 Etessami and 



Yannakakis 2010) 



Theorem 2. The STOCHASTIC-blind-policy problem is SQRT-SUM-/iarrf. 

Proof. Let ci, . . . , c„ and d be the inputs of SQRT-SUM. The reduction constructs 
an MDP with n + 1 states and n actions, where the (n + l)th state is absorbing 
(self-looping). The starting probabilities are fii = - for states i — 1, . . . ,n and 
yU„+i = 0, and the costs depend only on state and are given by the inputs Cj 
for states i = 1, . . . , n and c„+i = 0. From each state i = 1, . . . , n, the ith 
action deterministically stays at state i while all other actions deterministically 
transition to the absorbing state n -I- 1. 

For each state i = 1, . . . , n, the Bellman occupancy constraint reads Xi = 
(1 — j)/n + ^TTiXi, and the value function becomes: 



J(7r) 



1 



1=1 



C'lXi 



— Y- 



IT^i 



(5) 



Differentiating J with respect to tt after introducing a Lagrange multiplier A for 
the constraint ^^ tt^ = 1, and setting to zero, gives an equation that involves A 
and TTi. We can eliminate A from that equation by solving for each tt^ and then 
using ^^ TTj = 1, resulting in an optimal multiplier 



7(1-7) f^ 



^*-;i^^.[l.v^^)- (6) 



=1 



Substituting in (Is]) the (irrational) tt* corresponding to A* we get the optimal 
value: 



1-7 






2 



The STOCHASTIC-BLIND-POLICY question of whether there exists a stochastic 
blind controller w with value J(7r) < r is clearly equivalent to the question 

whether J* < r. By choosing r ~ n(n- ) ' ^® ^^^ from (7| that the condition 

J* < r is equivalent to X]j=i V^ — '^' ^^'l ^l^'^ reduction is complete. D 

5 A Special Case that is in P 

We outline here a special case that is solvable to arbitrary accuracy in polyno- 
mial time via semidefinite or second-order cone programming, and a variant in 
which the exact optimal solution can be computed in polynomial time. 

For each action a, let Pa denote the corresponding MDP transition matrix, 
Pa(s, s) = p{s\s, a). The special case assumes that each matrix P^ is symmetric 
(and therefore doubly stochastic). The bilinear program ^ then reads: 

min(l-7)7r^C^(l-7M^)"V, (8) 



ttSA 

where M^ =Ea'^aPa- 

Lemma 1. For any tt, the matrix I — 7M7r is positive definite. 

Proof. Since each matrix P^ is symmetric and stochastic, all its eigenvalues are 
real and satisfy A(Pa) < 1. Hence, the eigenvalues of I — 7Pa are also real and 
satisfy A(I — 7Pa) = 1 — 7A(Pa) > because 7 < 1. Therefore, I — 7Pa is a 
positive definite matrix, and so must be the matrix I — 7M^ as it can be written 
as the convex combination (over tt) of positive definite matrices. D 

If we constrain the feasible region to those tt for which Ctt — Kfi, with 
K e M, then we can formulate the program (|8| as a matrix fractional program, 
which, by taking epigraph and a Schur complement, and using Lemmafl] can be 
expressed as a convex program involving a linear matrix inequality and linear 
constraints: 



min t 

teR,«;eR,TreA 

s.t. 



~ 7M^ fi 
fjj t 



(9) 
^0, M^ = ^^aPa, Ctt = h(i, 



which can be solved efficiently to arbitrary accuracy by semidefinite program- 



ming or second-order cone programming (Boyd and Vandenberghe 2004). 

If we further assume that the costs are nonpositive and satisfy C = —Kfil^, 
with K > 0, then ([8]) becomes a minimization of a concave function over the 
probability simplex, hence its optima will appear in a corner of the simplex and 
the optimal controller will be deterministic. Since there are only k deterministic 
controllers, evaluating each of them and selecting the optimal one takes 0{kn^) 
operations. 

6 Conclusions 

In response to the computational intractability of searching for optimal policies 
in POMDPs, many researchers have turned to finite-state controllers as a more 
tractable alternative. We have provided here a computational characterization 
of exactly solving problems in the class of stochastic controllers, showing that 
(1) they are NP-hard, (2) they are in PSPACE, and (3) they are SQRT-SUM-hard, 
hence showing membership in NP would resolve long-standing open problems. 

We note that our NP-hardness proof relies on the assumption that the costs 
Csa are nondegenerate functions of both state and action. We have been unable 
to extend the NP-hardness proof to the case where the costs are functions of 
state only. Although the proof of SQRT-SUM-hardness employs such costs, no 
hardness result above polynomial time is known for SQRT-SUM, leaving the com- 
plexity of the case of state-only-dependent costs of the stochastic blind controller 
problem open. 

In this work, we only addressed the complexity of the decision problem for 
the discounted infinite-horizon case. There are several open questions, in par- 
ticular the complexity of approximate optimization for this class of stochastic 
controllers. The related literature addresses only the case of deterministic con- 



trollers (Lusena et al. 2001). 
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